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ABSTRACT 

We have developed Lynx (http://lynx.ci. uchicago. 
edu)— a web-based database and a knowledge ex- 
traction engine, supporting annotation and analysis 
of experimental data and generation of weighted 
hypotheses on molecular mechanisms contributing 
to human phenotypes and disorders of interest. Its 
underlying knowledge base (LynxKB) integrates 
various classes of information from >35 public data- 
bases and private collections, as well as manually 
curated data from our group and collaborators. Lynx 
provides advanced search capabilities and a variety 
of algorithms for enrichment analysis and network- 
based gene prioritization to assist the user in 
extracting meaningful knowledge from LynxKB and 
experimental data, whereas its service-oriented 
architecture provides public access to LynxKB and 
its analytical tools via user-friendly web services 
and interfaces. 

INTRODUCTION 

Technological advances in genomics now allow us to 
produce biological data at unprecedented tera- and 
petabyte scales. The extraction of useful knowledge from 
these voluminous data sets critically depends on seamless 
integration of chnical, genomic and experimental informa- 
tion with prior knowledge about genotype-phenotype 
relationships accumulated in a plethora of databases. 
Furthermore, these large and complex integrated know- 
ledge bases should be accessible to search engines and 



algorithms that drive efficient knowledge extraction 
advancing scientific insight and the development of bio- 
medical applications. 

To meet these challenges, we developed Lynx (http:// 
lynx.ci.uchicago.edu), a web-based database and a know- 
ledge extraction engine for annotation and analysis 
of high-throughput biomedical data. Lynx database was 
designed specifically to support both discovery-based and 
hypothesis-based approaches to prediction of genetic 
factors and networks contributing to phenotypes of 
interest. Such unique support is provided by integration 
of vast amounts of information (e.g. genomic data, 
pathways and molecular interactions and other) from 
pubhc and private repositories, as well as the targeted 
acquisition of phenotypic information and data describing 
association of genetic factors with diseases, clinical 
symptoms and phenotypic features. Lynx advanced 
search engines and a variety of algorithms for enrichment 
analysis and network-based gene prioritization support 
the extraction of meaningful knowledge from LynxKB 
and experimental data provided by the users. Lynx also 
enables formulation of weighted hypotheses regarding 
molecular mechanisms contributing to human phenotypes 
and disorders of interest. 

LYNX DESIGN AND COMPONENTS 

The Lynx database system has the following major com- 
ponents: (i) Integrated Lynx knowledge base (LynxKB); 
(ii) Knowledge extraction services currently available for 
LynxKB, including advanced search capabiHties, features- 
based gene enrichment analysis and network-based gene 
prioritization, which may be invoked via the Lynx REST 



*To whom correspondence should be addressed. Teh +1 773 702-4960; Fax: +1 773 834-0505; Emaih mahsev(§ uchicago.edu 
Correspondence may also be addressed to Dinanath Sulakhe. Tel: +1 630 252 7856; Fax: +1 630 252 5676; Email: sulakhe@mcs.anl.gov 
Present address: 

Natalia Maltsev, Human Genetics Department, the University of Chicago, CLSC, E. 58th str. Chicago, IL, 60637, USA. 
Dinanath Sulakhe, Computation Institute, Chicago, IL 60637, USA. 

© The Author(s) 2013. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecomnions.org/licenses/ 
by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial 
re-use, please contact journals.permissions@oup.com 



D1008 Nucleic Acids Research, 2014, Vol. 42, Database issue 



interface; and (iii) 'Web Interface', a user-friendly web 
interface for accessing the annotations and analytical 
tools. 

Lynx integrated knowledgebase 

LynxKB is a database integrating modeled data from >35 
databases and manually curated private collections 
(Table 1). These data are used for annotation and extrac- 
tion of knowledge from LynxKB via database queries 
or from experimental data provided by the user. An 
XML schema-driven annotation service supports annota- 
tions from the LynxKB as RESTful web services. 
Additionally, LynxKB contains a number of manually 
curated in-house data collections, including inter alia 
customized ontologies for early brain development and 
brain connectivity (developed in collaboration with 
Dr. Paciorkowski, University of Rochester), weighted col- 
lections of candidate genes provided by our clinical 
collaborators or extracted from Developmental Brain 
Disorders Database (DBDB) and other disease-related 
data sources such as AutDB (19), Schizophrenia Gene 
Resource (20), LisDB (https://Hsdb.ci.uchicago.edu) and 
Cancer Gene Index (https://wiki.nci.nih.gov/display/ 
cageneindex/caBIO). Lynx also provides an exclusive ana- 
lytical access to the text-mining data describing molecular 
interactions from GeneWays (26). Integration of the data 
describing clusters of transcription factors binding sites 
(28) and enhancers (29), as provided by the Vista 
project, allows one to factor the information regarding 
non-coding genomic signals into the Lynx predictions of 
genetic factors involved in disorders of interest. Integrated 
structured data from Lynx KB is available for downloads 
in multiple formats (e.g. XML, CSV, TXT, JSON) via a 
web-based user interface and web services. 



Table 1. Data types and resources integrated in LynxKB 



Type of data 



Source 



Genomic 

Proteomic 

Pathways-related 

Disease-specific 

Plienotypic 
Variations 

Text-mining 
Pliarmacogenomics 



NCBI (1), Ensembl (2), UniGene (3), 

TRANSFAC'' (4). RefSeq (5) 
BIND (6), BioGRID (7), HPRD (8), MINT 

(9), UniProt (10), InterPro (11) 
KEGG (12), Reactome (13), NCI (14), 

BioCarta, STRING'' (15), TRANSPATH'' 

(16), Pathway Commons (17) 
OMIM, Disease ontology (18), AutDB (19), 

SZGR (20), Cancer gene index, AGRE, 

DBDB", LisDB" 
OMIM, Human plienotype ontology (21), 

customized ontologies" 
Genomic association database (22), Database 

of genomic variants (23), Human genomic 

mutation database"' (24), SLEP (25), 

NHGRI 

GeneWays" (26), Diseases (University of 

Copenhagen) 
Comparative toxicogenomics database 

(CTD) (27) 



"Customized and manually curated sources of information. 
''The resources are not displayed on the annotations page due to the 
proprietary license restrictions and/or are used exclusively in the ana- 
lytical pipehnes. 



Lynx data are available for download in a number of 
ways: (i) 'Lynx KB database dumps'. Due to the fact that 
public data are available for download at the respective 
sources and the size of a complete integrated Lynx KB is 
prohibitively large, downloading the full content of Lynx 
KB may be impractical. However, any part of the public 
data integrated into Lynx KB is available for download in 
the form of tab-delimited tables and database dumps on 
request; (ii) all annotations and results of analysis in Lynx 
are available for download in CSV format via the 
'download' button displayed on every page; and (iii) any 
Lynx object or set of objects as well as the results of an- 
notation and analysis may be downloaded using web 
services in JSON and XML format. 

Lynx knowledge extraction engine 

Seamless integration of data, knowledge-extraction 
services and integrative analysis in Lynx provide a one- 
stop solution for generating weighted hypotheses regard- 
ing the molecular mechanisms contributing to the pheno- 
types of interest (Figure 1). Lynx supports multiple entry 
points for annotation and analysis of individual objects 
(e.g. genes, pathways, disorders) and batch queries. The 
user can submit search-based queries to LynxKB or 
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Figure 1. A workflow of knowledge extraction in the Lynx database 
where initial query genes are filtered interactively using annotations 
or based on the results of enrichment analysis. Resulting gene sets 
are ranked by the user according to his/her preferences and further 
prioritized using networks-based prioritization assisting in the predic- 
tion of molecular mechanisms contributing to the phenotype or biolo- 
gical process of interest to the user. 
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experimental results to be analyzed by Lynx, such as the 
results of next-generation sequencing (NGS), copy 
number variation-based analyses or gene expression data 
in the form of SNPs, genomic coordinates or gene hsts to 
be annotated or downstream analyzed via the web user 
interface and its integrated services. Lynx provides the 
following knowledge extraction tools for the downstream 
analyses and annotations: 

Advanced search 

The large-scale integration of biomedical data in Lynx 
provides a great opportunity to mine these data with a 
systems perspective in mind. Its powerful search 
capabihties, based on Apache Lucene (http://lucene. 
apache.org/), allow users to generate highly selective 
data sets by filtering the queries to LynxKB on multiple 
parameters of interest to the user (e.g. phenotypes, 
pathways, keywords), as illustrated below in the case 
study, as well as highly efficient search functionality 
based on phrase queries, wildcard queries and Boolean 
operators for a deeper refinement of search results. 
Additionally, as illustrated below in the case study, users 
can start with broader searches based on diseases, 
pathways or symptoms of interest and then further 
refine and narrow down the results of the searches accord- 
ing to the parameters of interest. Another important 
feature of the advanced search functionahty in Lynx is 
that the results of the queries are presented in association 
with other relevant annotations, such as genes, pathways, 
tissues, phenotypes and more, to provide a comprehensive 
overview for an object of interest. Lynx's advanced search 
capabihties provide a unique perspective on the biological 
data of interest and can be an extremely powerful tool for 
researchers. 

Annotation services 

Lynx's XML schema-driven annotation service provides 
annotations from the integrated database as RESTful web 
services. Every query to the LynxKB for an individual 
object (e.g. a gene) or a batch query (e.g. fist of genes or 
genomic coordinates) extracts all information relevant to 
the query from LynxKB for the growing list of annota- 
tions [e.g. gene function description (RefSeq), associated 
pathways, diseases, clinical symptoms, molecular inter- 
actions, toxicogenomic information. Gene Ontology 
categories, tissues and other related annotations] and 
displays it to the user according to his/her preferences. 
Lynx provides detailed web interfaces for single-gene 
or multiple-gene annotations that allow users to get a 
complete understanding of the functionahty of the genes 
of interest from various different perspectives. All infor- 
mation related to the objects is easily accessible via user 
interface and available for download in tab-delimited, 
XML or JSON formats (web services). 

Statistical enrichment analysis 

Lynx assists the user in formulating the hypotheses 
regarding the molecular mechanisms involved in the phe- 
nomena under study by providing tools for enrichment 
analysis and identification of functional categories 
over-represented in the query data sets. Two singular 



enrichment analysis algorithms, Bayes factor and 
/"-value estimates are used in our pipeline for this 
purpose (see Xie et al. for more description and results 
of analysis (30)). Enrichment analysis in Lynx is based 
on a large variety of features obtained from multiple 
sources [e.g. associated pathways and diseases (Table 1), 
various levels of resolution of Gene Ontology terms], as 
wefi as unique-for-the-system customized brain develop- 
ment and brain connectivity ontologies, symptoms-level 
phenotypes and associated non-coding signals (e.g. enhan- 
cers and clusters of transcription factors binding sites). 
The results of the enrichment analyses based on multiple 
categories of interest to the user may be used for 
formulating a working hypothesis regarding molecular 
mechanisms involved in phenomena of interest. Lynx 
also supports contextual enrichment analysis (e.g. 
against genes expressed in a particular tissue or on a 
particular developmental stage) that may substantially 
increase the accuracy of the results. 

Network-based gene prioritization 

Gene prioritization proposes promising candidate genes 
from a large set of genes or even from the entire genome 
for a disease or phenotype of interest. Here, for network- 
based gene prioritization. Lynx integrates five network 
propagation algorithms [simple random walk, heat 
kernel diffusion (31), PageRank with priors (32), HITS 
with priors (33) and K-step Markov (33)], and using 
STRING version 9.0 (15) as the underlying protein inter- 
action network as initially suggested in PINTA (34,35). To 
use known disease genes as input, the algorithms were 
accordingly modified for Lynx by replacing the continu- 
ous microarray expression data — as requested from the 
original PINTA implementation — with binary data using 
seed genes associated with a disease or phenotype of inter- 
est: a T is fed as an input for each seed gene, whereas a '0' 
is assigned to all non-seed genes (36). Additionally, these 
algorithms were modified to accommodate a variety 
of weighted data types to be used for gene prioritization 
including ranked gene to phenotype associations, 
weighted canonical pathways, gene expression, NGS 
data and others. Consequently, the propagation algo- 
rithms for gene prioritization provide a ranked hst of 
novel and promising candidate genes based on the 
propagated signal through the network, starting from 
binary data associated with disease related genes in the 
network. 

CASE STUDY: IDENTIFICATION OF MOLECULAR 
MECHANISMS ASSOCIATED WITH SEIZURES 
IN AUTISM 

This case study will illustrate the functionahty of Lynx by 
predicting genes and molecular mechanisms associated 
with a particular symptom of autism (seizures) based on 
various Lynx analyses, such as annotation, gene set 
enrichment analysis and gene prioritization. 

Autism spectrum disorders (ASD) are known to be 
associated with an increased incidence of epilepsy and 
of epileptiform discharges on electroencephalograms. 
However, it is unknown whether epileptiform discharges 
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correlate with symptoms of ASD and what are the 
contributing molecular mechanisms (37,38). 

To formulate a weighted hypothesis regarding genes 
and molecular mechanisms potentially contributing to 
epilepsy in patients with autism, we have performed the 
following steps: 

Step 1 : Lynx advanced search was used to perform 'fuzzy' 
search for autism candidate genes against 'disease' 
object. The search returned 483 genes associated with 
autism by OMIM, AutDB and Disease Database from 
the University of Copenhagen. These genes were further 
filtered using 'seizures' as a fuzzy search term. The 
resulting query returned 59 genes positively associated 
both with autism and seizures phenotype 
(Supplementary Table SI). 

Step 2: The enrichment analysis of these 59 genes 
associated both with autism and seizures showed 
over-representation of the functional categories 
associated with synaptic transmission and ionotropic 
glutamate receptor binding and voltage-gated sodium 
channel activity already known to be associated with 
ASD and epileptic phenotypes. 

Step 3: The 59 genes obtained in Step 1 can be ranked 
according to the strength of their association with 
autism, as suggested by AutDB or expert curation, 
or can be assigned a default score of '1' as shown in 
the use case. 

Step 4: The ranked set of genes from Step 3 was used as an 
input to the gene prioritization tool, based on the heat 
kernel-ranking algorithm (Supplementary Table S2). 
Default parameters were used to run the algorithm. 

The results of gene prioritization allowed predicting 
additional 31 high-scoring genes {P = <0.02) potentially 
contributing to epileptic phenotype in autistic patients 
(Supplementary Table S3). A number of these genes pre- 
dicted by the network were recently found to be associated 
with ASD and epileptic phenotypes, but not yet included 
in AutDB and OMIM databases (and consequently in 
LynxDB) as markers for ASD. These include: DLG3, 
discs, large homolog 3 (39), GADl, glutamate decarboxyl- 
ase 1, brain type (40), DOCKS, dedicator of cytokinesis 8 
(41), GABRB3, GABA A receptor, beta 3 (42), GLUD2, 
glutamate dehydrogenase 2 (43) and others (see 
Supplementary Table S3 for more details). All results of 
analyses are available for download in various formats via 
user interface or web services. A video and tutorial 
describing this and other examples of using Lynx for 
data annotation and analyses are available at the Lynx 
Web site at http://lynx.ci.uchicago.edu/usecase.html. 

SYSTEMS ARCHITECTURE 

Lynx is designed using a service-oriented architecture and 
is implemented using JAX-RS and Spring framework, (44) 
to provide the integrated data and analytical tools as 
RESTful services (45). The integrated data are modeled 
and represented as XML schemas and using JAXB (46) 
are automatically translated into Java objects that are 
then used to encapsulate data from the MySQL 



database. The resulting annotations and results of 
analysis are delivered in XML, JSON or TXT format as 
per the request. The project is being developed using the 
Maven (http://maven.apache.org) multi-module architec- 
ture so that various data access objects (DAO) modules; 
service modules and REST-resource modules are inde- 
pendently implemented and reused where necessary 
using Spring's dependency injection. The algorithms 
involved in the analytical steps are implemented using 
Java and required statistical packages (such as Matlab, 
which is used in network-based prioritization) and 
integrated within the project as maven modules. 
The modular design architecture allows us to maintain 
'separation of concerns' within the complete project 
without introducing any design or architecture-based 
dependencies. 

Data and analytical web services 

Although the integrated data and annotations as well as 
the various analytical tools are presented to the users via 
web interface, the service-oriented architecture enables 
other users/groups to leverage our work and integrate it 
within their own research tools and platforms. For 
example, there are current ongoing efforts by the Globus 
Genomics project (47) at the University of Chicago 
Computation Institute to integrate the Lynx Knowledge 
base annotation services and analytical workflows (via 
web services) for analysis and annotation of the results 
of the NGS. The Developmental Brain Disorders 
Database (https://www.dbdb.urmc.rochester.edu/home) 
at the University of Rochester and RViewer (48) are 
also using Lynx RESTful web service interface for anno- 
tation of genomic data. End users can download the data 
sets of interest and results of analysis from the web 
interface. 

CONCLUSIONS 

We present the Lynx database and knowledge extraction 
suite of tools designed specifically to support the discovery 
and hypothesis-based approaches to identification of 
genetic factors contributing to phenotypes or disorders 
of interest. Lynx integrates the main downstream 
analyses, such as gene annotation, gene set enrichment 
analysis and gene prioritization within one engine, based 
on a large knowledge base from public and private 
data and a powerful search engine that enables the user 
to access the knowledge base in a user-friendly web 
interface. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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